The present document is divided in 3 parts:
The dataset used for this assignment contains data of several hundreds wine reviews, with information about the country of production and the unit cost. The dataset contains both quantitative and qualitative variables.
The dataset was downloaded from the website kaggle.
In order to import the dataset in RStudio, I used the read.csv function. Then, I renamed it “DATA”.
#load the dataset
read.csv("winemag-data_first150k.csv", header = T, sep = ",")
#Name the dataset
DATA <- read.csv("winemag-data_first150k.csv", header = T, sep = ",")
The first step to analyze this dataset is to show its structure and give some high-level information regarding it. Using the dim function we know that the dataset contains 150,930 observations and 11 variables. The ls function shows the column’s headers. The head function allows to explore the first 6 rows of the dataset. Finally the str function gives more info for each variable.
#more info about the dataset
dim(DATA)
## [1] 150930 11
ls(DATA)
## [1] "country" "description" "designation" "points" "price"
## [6] "province" "region_1" "region_2" "variety" "winery"
## [11] "X"
head(DATA)
str(DATA)
## 'data.frame': 150930 obs. of 11 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ country : chr "US" "Spain" "US" "US" ...
## $ description: chr "This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry f"| __truncated__ "Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vani"| __truncated__ "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and com"| __truncated__ "This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vin"| __truncated__ ...
## $ designation: chr "Martha's Vineyard" "Carodorum Selección Especial Reserva" "Special Selected Late Harvest" "Reserve" ...
## $ points : int 96 96 96 96 95 95 95 95 95 95 ...
## $ price : num 235 110 90 65 66 73 65 110 65 60 ...
## $ province : chr "California" "Northern Spain" "California" "Oregon" ...
## $ region_1 : chr "Napa Valley" "Toro" "Knights Valley" "Willamette Valley" ...
## $ region_2 : chr "Napa" "" "Sonoma" "Willamette Valley" ...
## $ variety : chr "Cabernet Sauvignon" "Tinta de Toro" "Sauvignon Blanc" "Pinot Noir" ...
## $ winery : chr "Heitz" "Bodega Carmen RodrÃguez" "Macauley" "Ponzi" ...
Because I run some statistics, I wanted to verify whether and how many null values the dataset contains.
#control of NA values
any(is.na(DATA))
## [1] TRUE
sum(is.na(DATA))
## [1] 13695
colSums(is.na(DATA))
## X country description designation points price
## 0 0 0 0 0 13695
## province region_1 region_2 variety winery
## 0 0 0 0 0
From this control, I realized that the dataset contains 13,695 null values and that all of them are in the Price column. Knowing this, I decided to work with a version of the dataset that does not contain rows with null values. I called this smaller version of it “DATA.clean”. This modified dataset contains 137,235 rows, 13,695 rows less than the original dataset. I could have used a function that allowed me to exclude only the rows that contain null values for the Price variable, but in this particular case all the null values are on the Price column, so it makes no difference.
DATA.clean <- na.omit(DATA)
any(is.na(DATA.clean))
## [1] FALSE
The next step is to run some summary statistics of the main variables. For example I found that the average price for a wine bottle is 33.13 USD, the cheapest wine is priced at 4 USD, the most expensive bottle costs 2,300 USD, and the quantiles are respectively 4, 16, 24, 40, and 2,300 USD.
#summary of price
summary(DATA.clean$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 16.00 24.00 33.13 40.00 2300.00
median(DATA.clean$price)
## [1] 24
quantile(DATA.clean$price)
## 0% 25% 50% 75% 100%
## 4 16 24 40 2300
min(DATA.clean$price)
## [1] 4
max(DATA.clean$price)
## [1] 2300
#summary of location
summary(DATA.clean$country)
## Length Class Mode
## 137235 character character
#summary of review points
summary(DATA.clean$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80.00 86.00 88.00 87.79 90.00 100.00
#summary of variety
summary(DATA.clean$variety)
## Length Class Mode
## 137235 character character
Then, I focused on the Country variable and I created a table that shows the distribution of the reviewed wines between the countries.
#focus on country variable
table(DATA.clean$country)
##
## Albania Argentina
## 5 2 5587
## Australia Austria Bosnia and Herzegovina
## 4894 2483 4
## Brazil Bulgaria Canada
## 25 77 194
## Chile China Croatia
## 5766 3 83
## Cyprus Czech Republic England
## 31 6 8
## France Georgia Germany
## 14785 43 2347
## Greece Hungary India
## 872 230 8
## Israel Italy Japan
## 610 18784 2
## Lebanon Lithuania Luxembourg
## 37 8 9
## Macedonia Mexico Moldova
## 16 63 71
## Montenegro Morocco New Zealand
## 2 12 3070
## Portugal Romania Serbia
## 4176 139 14
## Slovakia Slovenia South Africa
## 3 81 2237
## South Korea Spain Switzerland
## 4 8160 4
## Turkey Ukraine Uruguay
## 50 5 85
## US US-France
## 62139 1
producers <- table(DATA.clean$country)
table(DATA.clean$country)/nrow(DATA.clean)
##
## Albania Argentina
## 3.643385e-05 1.457354e-05 4.071119e-02
## Australia Austria Bosnia and Herzegovina
## 3.566146e-02 1.809305e-02 2.914708e-05
## Brazil Bulgaria Canada
## 1.821693e-04 5.610814e-04 1.413634e-03
## Chile China Croatia
## 4.201552e-02 2.186031e-05 6.048020e-04
## Cyprus Czech Republic England
## 2.258899e-04 4.372063e-05 5.829417e-05
## France Georgia Germany
## 1.077349e-01 3.133311e-04 1.710205e-02
## Greece Hungary India
## 6.354064e-03 1.675957e-03 5.829417e-05
## Israel Italy Japan
## 4.444930e-03 1.368747e-01 1.457354e-05
## Lebanon Lithuania Luxembourg
## 2.696105e-04 5.829417e-05 6.558094e-05
## Macedonia Mexico Moldova
## 1.165883e-04 4.590666e-04 5.173607e-04
## Montenegro Morocco New Zealand
## 1.457354e-05 8.744125e-05 2.237039e-02
## Portugal Romania Serbia
## 3.042956e-02 1.012861e-03 1.020148e-04
## Slovakia Slovenia South Africa
## 2.186031e-05 5.902284e-04 1.630051e-02
## South Korea Spain Switzerland
## 2.914708e-05 5.946005e-02 2.914708e-05
## Turkey Ukraine Uruguay
## 3.643385e-04 3.643385e-05 6.193755e-04
## US US-France
## 4.527927e-01 7.286771e-06
perc.producers <- table(DATA.clean$country)/nrow(DATA.clean)
I decided to transform the variable Points using log10 function.
summary(DATA.clean$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80.00 86.00 88.00 87.79 90.00 100.00
summary(log10(DATA.clean$points))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.903 1.934 1.944 1.943 1.954 2.000
For the final step, I plotted a bar chart of the countries distribution to see which countries are the most important producers of wine among the sample contained in the dataset.